WIP: net, infra, stuntime: Measure connectivity gap during VM live migration#4238
WIP: net, infra, stuntime: Measure connectivity gap during VM live migration#4238Anatw wants to merge 2 commits intoRedHatQE:mainfrom
Conversation
Introduce the STD for VM stuntime measurement during live migration on Linux bridge and OVN localnet secondary networks, for both IPv4 and IPv6. These tests provide baseline measurements and a framework for regression detection. Baseline and Threshold: - Per-scenario threshold - setting the threshold to the minimum of (max_observed × 4) and 5 seconds — i.e. 4× the worst observed value for that scenario, capped at 5 seconds. - Thresholds will be hardcoded based on 10-run baselines per-scenario, on a BM cluster. - A per-scenario approach is used instead of a global threshold to prevent slow-path scenarios from masking regressions in faster ones. Measurement Methodology: - Ping command: ICMP ping at 100ms intervals with UNIX timestamps (`ping -D -O -i 0.1` for IPv4, `ping -6 -D -O -i 0.1` for IPv6). - Stuntime calculation: Stuntime is the largest gap between any two consecutive successful replies in the ping log - i.e., the maximum time difference between timestamps of successive successful packets. The boundaries (last success before loss, first success after recovery) are the pair of packets that define that gap. Stuntime = (Timestamp of first success after) - (Timestamp of last success before) - Alternatives rejected: tcping (introducing a new dependency), iperf3 (unnecessary complexity), and curl (unnecessary complexity - requires an active web server inside the VM). IP Family: - IPv4 and IPv6 are measured in separate migrations to avoid interactions between ARP (IPv4) and NDP (IPv6) recovery paths. - ip_family is a parametrize dimension with pytest.mark.ipv4 and pytest.mark.ipv6 applied per value, allowing selective runs: IPv4-only (`pytest -m ipv4`), IPv6-only (`pytest -m ipv6`), or both (no `-m` flag needed). - Total scenarios: 24 (2 CNI types × 3 migration paths × 2 ping initiators × 2 IP families). Both VMs start on the same node to align with the first scenario logic: migrating from the same node to a different node (static_to_different). Signed-off-by: Anat Wax <awax@redhat.com> Assisted by: Cursor
Quantify downtime between pings for regression detection and baselines. Signed-off-by: Anat Wax <awax@redhat.com> Assisted by: Cursor
📝 WalkthroughWalkthroughTwo new files added for VM stuntime measurement during live migration: a helper module providing stuntime computation logic that parses ping output and calculates maximum gap between consecutive timestamps, and a test module containing test class skeletons for Linux bridge and OVN localnet network types. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Report bugs in Issues Welcome! 🎉This pull request will be automatically processed with the following features: 🔄 Automatic Actions
📋 Available CommandsPR Status Management
Review & Approval
Testing & Validation
Container Operations
Cherry-pick Operations
Label Management
✅ Merge RequirementsThis PR will be automatically approved when the following conditions are met:
📊 Review ProcessApprovers and ReviewersApprovers:
Reviewers:
Available Labels
💡 Tips
For more information, please refer to the project documentation or contact the maintainers. |
azhivovk
left a comment
There was a problem hiding this comment.
I'm not sure if we can merge this helper without usage, it also might be failing tox for unused code
| """Parse ping -D output and compute stuntime as the largest gap between successful replies. | ||
|
|
||
| Stuntime is the connectivity gap duration: the largest interval where no ICMP replies | ||
| were received. For example, with ping at 0.1s intervals, any gap > 0.1s indicates packet loss. |
There was a problem hiding this comment.
For example, with ping at 0.1s intervals, any gap > 0.1s indicates packet loss.
Please re-consider if this sentence should be here.
Why?
- It is supposed to serve as an example for the preceding sentence that explains what stuntime means, but it in fact explains response intervals.
- I am not sure it is true. AFAIK, the ICMP responses may be delayed, regardless of the intervals between the echos, but arrive eventually, and this is not considered as a packet-loss.
| Stuntime in seconds (float). | ||
|
|
||
| Raises: | ||
| InsufficientStuntimeDataError: When ping log has fewer than 2 reply timestamps. |
There was a problem hiding this comment.
I think this explanation is not clear. I had to return to the class docstring to understand why fewer than 2 timestamps is a problem.
|
|
||
| Preconditions: | ||
| - Running VM for migration on Linux bridge secondary network, running on worker1. | ||
| - Running peer VM on Linux bridge secondary network, running on worker1. |
There was a problem hiding this comment.
The class is defined as parameterized, but OTOH the pre-conditions are of a specific scenario from the matrix, where both VMs are scheduled on the same node (the co_located_to_remote scenario).
| - Running VM for migration on OVN localnet secondary network, running on worker1. | ||
| - Running peer VM on OVN localnet secondary network, running on worker1. |
|
/wip |
Short description:
Quantify downtime between pings for regression detection and baselines.
What this PR does / why we need it:
Add compute_stuntime helper to parse ping -D output and compute connectivity gap. To be used in the measurement scenarios.
Special notes for reviewer:
Based on the stuntime STP, not yet merged.
Verification
Output:
...
[1774185115.382314] 64 bytes from 172.16.2.2: icmp_seq=41 ttl=64 time=0.695 ms
[1774185115.486496] 64 bytes from 172.16.2.2: icmp_seq=42 ttl=64 time=0.785 ms
[1774185115.590352] 64 bytes from 172.16.2.2: icmp_seq=43 ttl=64 time=0.641 ms
[1774185115.694331] 64 bytes from 172.16.2.2: icmp_seq=44 ttl=64 time=0.676 ms
[1774185116.630529] 64 bytes from 172.16.2.2: icmp_seq=53 ttl=64 time=0.859 ms
[1774185126.824966] 64 bytes from 172.16.2.2: icmp_seq=151 ttl=64 time=3.22 ms
[1774185126.923577] 64 bytes from 172.16.2.2: icmp_seq=152 ttl=64 time=1.11 ms
[1774185127.023927] 64 bytes from 172.16.2.2: icmp_seq=153 ttl=64 time=0.907 ms
[1774185127.124168] 64 bytes from 172.16.2.2: icmp_seq=154 ttl=64 time=0.781 ms
[1774185127.230467] 64 bytes from 172.16.2.2: icmp_seq=155 ttl=64 time=0.719 ms
...
Measured stuntime: 10.194s
jira-ticket:
https://redhat.atlassian.net/browse/CNV-80581
Summary by CodeRabbit